arXiv:1503.00323vl [stat.ML] 1 Mar 2015 


Sparse Approximation of a Kernel Mean 


Efren Cruz Cortes 

Department of Electrical Engineering and Computer Science 
University of Michigan, Ann Arbor, MI 

Clayton Scott 

Department of Electrical Engineering and Computer Science 
University of Michigan, Ann Arbor, MI 


February 2015 


Abstract 

Kernel means are frequently used to represent probability distribu¬ 
tions in machine learning problems. In particular, the well known kernel 
density estimator and the kernel mean embedding both have the form 
of a kernel mean. Unfortunately, kernel means are faced with scalabil¬ 
ity issues. A single point evaluation of the kernel density estimator, for 
example, requires a computation time linear in the training sample size. 
To address this challenge, we present a method to efficiently construct a 
sparse approximation of a kernel mean. We do so by first establishing an 
incoherence-based bound on the approximation error, and then noticing 
that, for the case of radial kernels, the bound can be minimized by solv¬ 
ing the fc-center problem. The outcome is a linear time construction of a 
sparse kernel mean, which also lends itself naturally to an automatic spar¬ 
sity selection scheme. We show the computational gains of our method 
by looking at three problems involving kernel means: Euclidean embed¬ 
ding of distributions, class proportion estimation, and clustering using the 
mean-shift algorithm. 


1 Introduction 

A kernel mean is a quantity of the form 

( 1 ) 

n 

1—1 

where (p is a, kernel and xi,... ,Xn € are data points. We define kernels 
rigorously below. Our treatment includes many common examples of kernels, 
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such as the Gaussian kernel, and encompasses both symmetric positive definite 
kernels and kernels used for nonparametric density estimation. 

Kernel means arise frequently in machine learning and nonparametric statis¬ 
tics as representations of probability distributions. In this context, Xi,... ,a;„ 
are understood to be realizations of some unknown probability distribution. 
The kernel density estimator (KDE) is a kernel mean that estimates the density 
of the data. The kernel mean embedding (KME) is a kernel mean that maps 
the probability distribution into a reproducing kernel Hilbert space. These two 
motivating applications of kernel means are reviewed in more detail below. 

This work is concerned with efficient computation of a sparse approximation 
of a kernel mean, taking the form 


( 2 ) 

1=1 

where € M and k := |{i : Ui ^ 0}| ^ n. In other words, given xi,... ,a;„, 
a kernel </), and a target sparsity k, we seek a sparse kernel mean ([^ that 
accurately approximates the kernel mean Q. This problem is motivated by ap¬ 
plications where n is so large that evaluation or manipulation of the full kernel 
mean is computationally prohibitive. A sparse kernel mean can be evaluated or 
manipulated much more efficiently. In the large n regime, the sparse approxi¬ 
mation algorithm itself must be scalable, and as we argue below, existing sparse 
approximation strategies are too slow. 

Our primary contribution is an efficient algorithm for sparsely approximating 
a kernel mean. The algorithm results from minimizing a sparse approximation 
bound based on a novel notion of incoherence. We show that in the context of 
kernel means based on a radial kernel (defined below), minimizing the sparse 
approximation bound is equivalent to solving the fc-center problem on xi,..., a;„, 
which in turn leads to an efficient algorithm. 

The rest of the paper is outlined as follows. In Sectionj^we review the KDE 
and KME, which motivate this work, and also introduce a general definition of 
kernel that encompasses both of these settings. Next, in Section we formu¬ 
late the problem of sparsely approximating a sample mean in an inner product 
space, followed by a review of related work in Section]^ where we also detail our 
contributions. In Section we establish an incoherence-based sparse approxi¬ 
mation bound. We then use the principle of bound minimization in Section 
to derive a scalable algorithm for sparse approximation of kernel means, with 
a sparsity auto-selection scheme presented in Section |6.I[ Finally, Section 
applies our methodology in three different machine learning problems that rely 
on large-scale KDEs and KMEs, and demonstrates the efficacy of our approach. 
A preliminary version of this work appeared in [1]. A Matlab implementation 
of our algorithm is available at [2]. 
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2 Motivating Applications 

Our work is motivated by two primary examples of kernel means. We review 
the KDE and KME separately, and then propose a general notion of kernel 
that encompasses the essential features of both settings and is sufficient for 
addressing the sparse approximation problem. By way of notation, we denote 
[n] := {!,...,n}. 

2.1 Kernel Density Estimation 

Let {xi,... ,Xn} C be a random sample from a distribution with density 
/. In the context of kernel density estimation, a kernel is a function (j) such 
that for all x', / (f){x,x')dx = 1. In addition, (f) is sometimes also chosen to be 
nonnegative, although this is not necessary for theoretical properties such as 
consistency. The kernel density estimator of / is the function 

i^[n] 

The KDE is used as an ingredient in a number of machine learning method¬ 
ologies. For example, a common approach to classihcation is a plug-in rule that 
estimates the class-conditional densities with separate KDEs In anomaly 

detection, a detector of the form /(x )<7 is commonly employed to determine 
if a new realization comes from / EISIIZIIH]. In clustering, the mean-shift al¬ 
gorithm forms a KDE and associates each data point to the mode of the KDE 
that is reached by hill-climbing [9]. 

Evaluating the KDE at a single test point requires 0{n) kernel evaluations, 
which is undesirable and perhaps prohibitive for large n. On the other hand, 
a sparse approximation with sparsity k requires only 0{k) kernel evaluations. 
This problem is magnified in algorithms such as mean-shift, where a (derivative 
of a) KDE is evaluated numerous times for each data point. In our experiments 
below, we demonstrate the computational savings of our approach in KDE- 
based algorithms for the embedding of probability distributions and mean-shift 
clustering. 

2.2 Kernel Mean Embedding of Distributions 

Let {xi,..., x„} C be a random sample from a distribution P. A symmetric 
positive definite kernel is a function x —>■ K that is symmetric and is 

such that all square matrices of the form [(j){xi, Xj)]2j^i are positive semidefinite. 
Every symmetric positive definite kernel is associated to a unique Hilbert space 
of functions called a reproducing kernel Hilbert space (RKHS), which can be 
thought of as the closed linear span of {(^(•,x) | x G K'^} [TD]. The RKHS has 
a property known as the revroducinq property which states that for all f in the 
RKHS, f{x) = {f,fii;x)). 
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The idea behind the kernel mean embedding is to select a symmetric positive 
definite kernel </>, and embed P in the RKHS associated with </> via the mapping 

'I'(P) := J (j){-,x)dP{x). 

Since P is unknown, this mapping is estimated via the kernel mean 

$(P) 

n ^' 

ie[n] 

The utility of the KME derives from the fact that for certain kernels, is in¬ 
jective. This permits the treatment of probability distributions as objects in 
a Hilbert space, which allows many existing machine learning methods to be 
applied in problems where probability distributions play the role of feature vec¬ 
tors [n m [13 HI]. For example, suppose that random samples of size n are 
available from several probability distributions Pi,, Pm. A KME-based algo¬ 
rithm will require the computation of all pairs of inner products of kernel mean 
embeddings of these distributions, li xi,... ,Xn ^ P and ,...,^ P', then 
(d>(P), dt(P')) = by the reproducing property. Therefore the 

calculation of all pairwise inner products of kernel mean embeddings requires 
0{N'^'n?) kernel evaluations. On the other hand, if we have sparse representa¬ 
tions of the kernel means, these pairwise inner products can be calculated with 
only 0{N'^k‘^) kernel evaluations, a substantial computational savings. In our 
experiments below, we demonstrate the computational savings of our approach 
in KME-based algorithms for the embedding of probability distributions and 
class-proportion estimation. 

2.3 Generalized Notion of Kernel 

The problem of sparsely approximating a sample mean can be addressed more 
generally in an inner product space. This motivates the following definition 
of kernel, which is satisfied by both density estimation kernels and symmetric 
positive definite kernels. 

Definition 1. We say that (/) : x —>■ K is a kernel if there exists an inner 

product space P such that for all x in x) G H. 

In the case of kernel density estimation, all commonly used kernels satisfy 
4>{-,x) G L'^(W^) for all x G Recalling that consists of equivalence 

classes of functions, when we write <p{-,x) G we view (j){-,x) as a repre¬ 

sentative of its equivalence class. In the case of the kernel mean embedding, we 
may simply take H to be the RKHS associated with (f. 

Our proposed methodology applies to kernels of a particular form, given by 
the following definition. 

Definition 2. We say (/) : x —>■ K zs a radial kernel if (j) is a kernel as in 
Def 0 and there exists a strictly decreasing function g : [0,oo) —>■ K such that, 
for all X, x' G 

{((>{■, x),4>{-,x’))n =5(||x-x'||2). 
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We now review some common examples of radial kernels. The Gaussian 
kernel with parameter ct > 0 has the form 



the Laplacian kernel with parameter 7 > 0 has the form 



and the Student-type kernel with parameters a, /? > 0 has the form 



The parameters Ca, c-y and Ca,p can be set to 1 for the KME, or so as to normalize 
to be a density estimation kernel, depending on the application. 

These examples illustrate that the space T-L such that (j>{-,x) G H is not 
unique. Indeed, each of these three kernels is a symmetric positive definite 
kernel, and therefore we may take H to be the RKHS associated with (j) [iniiis]- 
On the other hand, we may also select H — 

Each of these three examples is also a radial kernel. If we take H to be the 
RKHS, then by the reproducing property we simply have (/){■, x')) = 

cj){x,x'), and in each case, (j){x,x') = g{\\x — x'||) for some strictly decreasing 
g. These kernels are also radial if we take "H = For example, con¬ 

sider the Gaussian kernel, and let us write (j) = (jja- to indicate the dependence 
on the bandwidth parameter. Then {(j)„{-,x),(j)a{.‘-iX'))i ^2 = (j)^^{x,x'). Simi¬ 
larly, for the Student kernel with a = (1 + d)/2 (the Gauchy kernel), we have 
{(j)p{-,x), ^')) z ,2 = '(' 2 / 3 ( 2 ;, x'). For other kernels, although there may not be 

a closed form expression for g, it can still be argued that such a g exists, which 
is all we will need. 

3 Abstract Problem Formulation 

In the interest of generality and clarity, we consider the problem of sparsely 
approximating a sample mean in a more abstract setting. Thus, let {%, (•, •)) be 
an inner product space with induced norm and let {zi,..., z„} C H. For 

a G M", define ||a||g := |{i | ai ^ 0}|. Given an integer k < n, our objective is 
to approximate the sample mean 2 ^ Zi as a A:-sparse linear combination 

of zi,..., Zri. In particular, we want to solve the problem 


minimize 11 z — z^W-^ 
subject to ||a||g = k 


(3) 


where z^ = Sicfn] 
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Note that problem ([^ is of the form of the standard sparse approxima¬ 
tion problem m, where {zi,..., Zn} is the so-called dictionary out of which 
the sparse approximation is built. Later we argue that existing sparse approx¬ 
imation algorithms are not suitable from a scalability perspective. Instead, 
we develop an approach that leverages the fact that the vector being sparsely 
approximated is the sample mean of the dictionary elements. We are most in¬ 
terested in the case where Zi = and </) is a kernel, but the discussion in 

Section [ 5 ] is held in a more abstract sense. 

4 Related Work and Contributions 

Problem (|^ is a specific case of the sparse approximation problem. Since in gen¬ 
eral it is NP-hard many efforts have been made to approximate its solution in a 
feasible amount of time. See |16] for an overview. A standard method of approx¬ 
imation is Matching Pursuit. Matching Pursuit is a greedy algorithm originally 
designed for finite-dimensional signals. Following the notation of Problem (|^ 
let z be the target vector we wish to approximate. In Matching Pursuit the first 
step is to pick an “atom” in {zi,..., z„} which captures most of z as measured by 
the magnitude of the inner product. After this first step the subsequent atoms 
are iteratively chosen according to which one captures more of the portion of z 
that hasn’t been accounted for mj. Note that just the first step of this algo¬ 
rithm requires to compute, for each Zj, the quantity (z, Zi) = ^ 

Since we have n Zi’s, the first step already takes r2(n^) kernel evaluations, which 
is undesirable. Another common approach. Basis Pursuit, has similar time com¬ 
plexity. 

Several algorithms which focus specifically on the sparse KDE case have been 
developed. In m a clustering method is used to approximate the KDE at a 
point by rejecting points which fail to belong to close clusters. In jT9j a relevant 
subset of the data is chosen to minimize the error but at an expensive O(n^) 
cost. In nnum a regression based approach is taken to estimate the KDE 
through its cumulative density function. Notice these algorithms rely heavily 
on the assumption that the KDE represents a probability distribution, so cannot 
be generalized to other kernel means. 

When the kernel mean is thought of as a mixture model, the model can be 
collapsed into a simpler one by reducing the number of its components through 
a similarity based merging procedure [HI |H1 HI] • Since these methods necessi¬ 
tate the computation of all pairwise similarities, they present quadratic compu¬ 
tational complexity. EM algorithms for this task result in similar computational 
requirements [25l 1^ . 

A line of work which tries to speed up general kernel sums comes historically 
from n-body problems in physics, and makes use of fast multipole methods EH 
EH] ■ The general idea behind these methods is to represent the kernel in question 
by a truncated series expansion, and then use a space partitioning scheme to 
group points, yielding an efficient way to approximate group-group or group- 
point interactions, effectively reducing the number of kernel evaluations. These 
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methods are usually kernel-dependent and do not yield a valid density. For 
the case of the Gaussian kernel, see ESEn] for two different space partitioning 
methods. Note that, contrary to these methods, our approach can still yield a 
valid density (discussed below), and can therefore be used to estimate quantities 
like the KL divergence. 

The efforts of rapidly approximating general kernel based quantities have led 
to the use of e-samples, or coresets. To define e-samples, first denote the data 
A := {cci,... ,Xn} and the kernel quantity of interest Q(A,x), where x is some 
query point (for example, the KDE is Q(A, x) = ^ Eiefn] e-sample 

is a set A' G A such that, for every query point x, Q(A,x) and Q(A',x) differ 
by less than e with respect to some norm. See [SI1E2] for the KDE case with 
ioo norm. For other kernel quantities, in specific the KME using the RKHS 
norm, see [33j- Both cases allow for constructions of e-samples in near linear 
time with respect to the data size and 1/e. Notice that our approach has the 
advantage that it handles both the KDE and KME cases simultaneously, and 
that if desired it can yield a valid density as the approximation. 

Although most of the literature seems to concentrate on the KDE, there have 
also been efforts to speed up computation time in problems involving the KME. 
As in the e-sample approach above, many of these problems require the distance 
between KMEs in the RKHS, so they focus on speeding up this calculation. In 
[34] . for example, a fast method is devised for the specific case of the maximum 
mean discrepancy statistic used for the two-sample test. 

Computing the kernel mean at each of the original points {xi,... ,a:„} can 
be thought of as a matrix vector multiplication, where the matrix in question 
is the kernel matrix. Therefore, an algebraic approach to this problem consists 
of choosing a suitable subset of the matrix columns and then approximating 
the complete matrix only through these columns. Among the most common 
of these is the Nystrom method. In the Nystrom method the kernel matrix 
K is approximated by the matrix QW^Q'^, where Q is composed of a subset 
of the columns of K, W is those columns intersected with their corresponding 
rows, and IF+ the best r-rank approximation to its pseudoinverse (see [35] for 
details). The columns composing Q are typically chosen randomly under some 
sampling distribution. See |36j for some examples of sampling distributions. As 
explained in Section [5T] our approach is connected to the Nystrom method and 
can be viewed as a particular scheme for column selection tailored to kernel 
means. The Nystrom approximation of the kernel matrix is not the only one 
used though, and other algebraic approaches exist. In m for example, an 
interpolative decomposition of the kernel matrix is proposed. 

In [55] a “coherence” based sparsification criterion is used in the context 
of one-class classification. The main idea is that each set of possible atoms 
{zi\ai ^ 0} can be quantified by the largest absolute value of the inner product 
between two different atoms. The method proposed requires the computation of 
the complete kernel matrix, and is therefore not suitable for our setting, which 
involves large data. The motivation for their coherence criterion, however, lies 
in the minimization of a bound on the approximation error. As seen in Section 
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|5.2[ we propose a similar bound as a starting point for our algorithm. 

Contributions 

We list a summary of contributions in this paper. 

• We present a bound on the sparse approximation error based on a novel 
measure of incoherence. 

• We recognize that for radial kernels, minimizing the bound is equivalent 
to solving an instance of the fc-center problem. The solution to the k- 
center problem, in turn, can be approximated by a linear running time 
algorithm. 

• Our method for approximating the KDE can be implemented so that the 
sparse kernel mean is a valid density function, which is important for some 
applications. 

• Our method provides amortization of computational complexity since the 
calculation of the set I (introduced below) is only computed once. Many 
subsequent calculations (e.g., kernel bandwidth search) can then be per¬ 
formed at a relatively small or negligible cost. 

• Our method is flexible in that it addresses different types of kernel means. 
In particular, it can be used to approximate both KMEs and KDEs. 

• Our method provides a scheme to automatically select the sparsity level. 

• We demonstrate the improved performance of the proposed method in 
three different applications: Euclidean embedding of probabilities (us¬ 
ing both the KDE and the KME), class proportion estimation (using the 
KME), and clustering with the mean-shift algorithm (using the KDE). 

5 Subset Selection and Incoherence-Based Bound 

Let us now reformulate problem ([^. Our approach will be to separate the 
problem into two parts: that of finding the set of indices i such that at is not 
zero, and that of finding the value of the nonzero a^’s. Letting I C [n] denote 
an index set, we can pose problem (|^ as 

min min llz — > aizAl'^ . (4) 

IC[„] (a.)iez ^ 

Note that the inner optimization problem is unconstrained and quadratic, and 
its solution, which for fixed I and k we denote by ax € ; is 


where Ki = {{zi, Zj))^ and kx is the fc-dimensional vector with entries 

^ G2:. 

Let ax = (ax,i)igx ~ Then we can rewrite problem (|^ 

as 

min ||z-zx|| . (5) 

xC [nj 
\i\=k 

5.1 Connection to the Nystrom Method 

Before continuing to the approximate solution of problem ([^, we briefly high¬ 
light its relationship to the Nystrom method. Given a set Id [n], let K be the 
kernel matrix of {zi\i G [n]}, K := ((zi, Zj})ij^in], and Kx the kernel matrix of 
{zi\i G I}, Kx := ((zi, Zj})ij^x- Also, let Qx be the binary matrix such that 
KQx is composed of the columns of K corresponding to I. Then we can rewrite 
ax and Kx as ax = {Qx^Qi) Qx^^n and Kx = QxKQx, where denotes 
the vector in M" with entries 1/n. By doing so, we can express the objective of 
([^ as 

||0 - zxf = iUK- KQxK^^Q^K^) 1„ 

= ll (^K-Kx) 1„. 

where Kx := KQxK^^Q^K^. We recognize Kx as the Nystrom matrix from 
the Nystrom method [36] . which is the only term dependent on I in the ob¬ 
jective. Therefore, our work can be interpreted from the Nystrom perspective: 
choose suitable columns of K and approximate K through the Nystrom matrix. 
The main difference is that the resulting approximation is based on the induced 
norm of the inner product space where the z^’s reside, instead of the commonly 
used spectral and Frobenius norms. 

5.2 An Incoherence-based Sparse Approximation Bound 

We now present our proposed algorithm to approximate the solution of problem 
(HI- Our strategy is to find an upper bound on the term Ijz — zx|| which is 
dependent on I and then find the T that minimizes the bound. First, we present 
a lemma which will aid us in finding the bound. 

Lemma 1. Let ("H, (•,•)) be an inner product space. Let S be a finite dimen¬ 
sional subspace ofH and Ps the projection onto S. For any Zq GK 

ll^szoll = max {zo,z). 

zGS.||z||=1 

Proof. First note that since S is finite dimensional, by the Projection Theorem 
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^0 — Pszo is orthogonal to S. Now, for any z G S with ||2;|| = 1 , we have 


{zo, z) = {Pszo + {zo - Pszo): z) 

= {Pszo, z) + (zo - Pszq., z) 
= {Pszo,z) 

<||Ps^o|||kl| = ||P 5 ^o||, 


where we have used the Cauchy-Schwartz inequality. To confirm the existence 
of a vector z which makes it an equality and therefore reaches the maximum, 
just let z = Pszo/ llPfiZoll- □ 

We can now present the theorem which will be the basis for our minimization 
approach. First, define 

vx '■= min max (z^, Zj ), 
iGX 

which we can think of as a measure of the “incoherence” of {z^ \ i G I}. It is 
now possible to establish a bound: 

Theorem 1. Assume that for some C > Q {zi, Zi) = C G [n]. Then for every 
P C [n], 

Proof. The beginning of this proof is similar to the one in [3S]. Let Sx ■= 
span({zi I i G I}) and denote Pg^ the projection operator onto Sx and / the 
identity operator. We have 

||z - zxll = ||z- - Ps^-zW = -II E - PsMW 

i£[n] 

<^Eii(^-^^xKii = ^Eii(^-^«xKii 

2G[n] i^X 


where we have used the triangle inequality, and the last equality is due to the 
fact that Zi = PsiZi when Zi G Sx- 

Now, since (zi — PsxZi) -L PsiZi, we can use Pythagoras’ Theorem in TL to 
get \\zi - PsxZ^\\^ = Ikif - \\PsxZi f- 

By Lemmairi IIP5 z^H = max {zi,z). Therefore, for i 

zGSx,|P|| = l 


ll-PSz^zll 


1 

—= max 
VC zgSx,INII-v^ 


{Zi,z) 


> 

> 


1 . , 

mm max (, 

VC 


Zt) 
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Thus, for i ^ I, 


and finally 


- \\Ps, 


l^'T 

<C- PP 

c 


\\z- ziW < 






□ 


6 Bound Minimization Via /c-center Algorithm 

In this section we apply the previous result in the context of approximating 
a kernel mean based on a radial kernel. Recall that, in the kernel mean set¬ 
ting, Zi = (/){■,Xi) and = g{\\xi — Xj\\^), where (/) is a radial 

kernel, {xi,... ,Xn} C and g is strictly decreasing as in Definition]^ Also 
note that for any radial kernel the assumption in Theorem is satisfied, since 
{(!){■, Xi), (!){■, Xi)) = g{Q) = 0 0 . 

Define the set I* as 

I* := arg min max min \\xi — a;,|| . 

IC[n] iGl 

\I\=k 

Then, since is a radial kernel and g is strictly decreasing, I* also maximizes 
vx = min max g{\\xi — XjH). Therefore, T* is the set that minimizes the bound 

i GX 

in Theorem [2 We have translated a problem involving inner products of func¬ 
tions to a problem involving distances between points in 

The problem of finding I* is known as the /c-center problem. To pose the k- 
center problem more precisely, we make a few definitions. For a fixed I, let Xx = 
{xi I * S 1} and Yx = {xj \ j ^ I}, and for all Xj G Yx define its distance to Xx 
as d{xj,Xx) = min ||a;i —Xj||. Furthermore, let W{Xx) = max Xx). 

Xi^Xx Xj^Yz 

Therefore, the fc-center problem is that of finding the set I of size k for which 
IF(Xx) is minimized. 

The fc-center problem is known to be NP-complete [35] ■ However, there 
exists a greedy 2-approximation algorithm [40| which produces a set Ik such 
that W(Xx;.) < 2W{Xx-)- This algorithm is optimal in the sense that under 
the assumption that Pt^NP there is no p-approximation algorithm with p < 2 
m- The algorithm is described in Fig. and as can be seen, it has a linear 
time complexity in the size of the data n. In particular, the algorithm runs in 
0{nkd) time. 

6.1 Computation of ctj and Auto-selection of k 

The fc-center algorithm allows us to find the set I on which our approximation 
will be based. After finding I we can determine the optimal coefficients ax¬ 


il 




input xi,..., Xn, k 

X i — 0 

Y i - {Xi, . . .,Xn} 

Choose randomly a first index u € [n] 

X^XU{xu} 

Y ^ Y\{x^} 

while |X| < k do 

Choose the element y GY for which d(jj,X) is maximized 
X^XU{y} 

Y ^ Y\{y} 

end while 

output Ik = {i G [n] \ xiG X) 

Figure 1: A linear time 2-approximation algorithm for the fc-center problem. 


Since the main computational burden is in the selection of X, we now have the 
freedom to explore different values of ax in a relatively small amount of time. 
For example, we can compute ax for each of several possible kernel bandwidths 
tr. 

The optimal way to compute ax depends on the application. If the user 
has a good idea of what the value of k is, then a fast way to compute ax 
for that specific value is to apply their preferred method to solve the equation 
Kxax = Kx- For example, since for symmetric positive definite kernels the 
kernel matrix is positive semi-definite, the preconditioned conjugate gradient 
method can be used to quickly obtain ax to high accuracy. This approach has 
the advantages of being simple and fast. 

A further advantage of our method is evident when the user has access only 
to a maximum tolerance value of k, say k^ax, but desires to stop at a value 
ko < kmax which performs as well as kmax- To do this, at iteration m > 1 in the 
fc-center algorithm we compute ax^ right after computing which provides 
a record of all the ax^ for 1 < j < fcp. To find fcoi use the information from 
the computed coefficients to form an error indicator and stop when some error 
threshold is overcome. Before showing what these error indicators are, we first 
provide an update rule to efficiently compute the a coefficients at each iteration 
step. 

Let Xm be the set of the first m elements chosen by the fc-center algorithm, 
and let ax„ > ATx^ and kx„ be obtained by using Xm- If we increase the number 
of components to m -|- 1, then as shown in |38j we have 


Ki, 


.+1 




where Xj^ is the element selected by the fc-center algorithm, and b = {4>{xj ^_^_^, Xi))igx^ • 
The resulting update rule for the inverse is 




0 

0 


+ qo{qq*) 
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where qq = ^/{4’{xjm+n^jr,i+i) ~ 9 = ^xl From here 

the user can now compute ai^^i by multiplying with 


Ki™+i 


1 . 

.n Z-^i—1 




Assuming we stop at some k^ax, the time complexity for computing all the 
ai^’s is 0 (fc^£j 3 ,) and the necessary memory 0{k'^^^). 

To automatically stop at some kg < kmax we need a stopping criterion based 
on some form of error. We propose the following: using the notation of problem 
([^ we have that 



= Ikll^ “ 2 ■ X! “T* • ^ X! + a^Kxax 
= ll^f - 


Since ||z||^ is a constant independent of I, we can avoid its 0{n^) computation 
and only use the quantities E\x\ ■= —ct'xi'^x as error indicators. Note that Et is 
nonincreasing with respect to t. Based on this we choose fcg to be the first value 
at which some relative error is small. In this paper we used the test 

\Eko-i - Eko\ ^ 

\Ei-EkJ - 

for some small e. The overall complexity amounts to O{nkod + k^d). 

A further consideration for computing ax should be made if the result is 
desired to be a probability mass function. In this case a fc-dimensional ax can be 
projected into the simplex := |i/ G J2i=i Vi = l,Ui > 0 \/ 1 < i < fc| 

after being obtained by any of the discussed methods (see [H]). Alternatively, 
a quadratic program which takes into account the constraints of non-negativity 
and J2i=i = 1 can be solved. 

A Matlab implementation of the complete Sparse Kernel Mean procedure 
can be found at [2]- 


7 Experiments: Speeding Up Existing Kernel 
Mean Methods 

We have implemented our approach in three specific machine learning tasks 
that require the computation and evaluation of a mean of kernels. In the first 
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of these, we apply our algorithm to the task of dimensionality reduction. In 
the second, we use it in the setting of class proportion estimation. Finally, we 
explore its performance when used as part of the mean shift algorithm. 

In the following we refer to our algorithm or to the resulting kernel mean as 
SKM (for Sparse Kernel Mean). We now provide a detailed description of each 
task and relevant results. The implementation has been done in Matlab. 


7.1 Euclidean Embedding of Distribntions 

In this experiment we embed probability distributions in a lower dimensional 
space for the purpose of visualization. Given a collection of N distributions 
{Pi,..., Pat}, the procedure consists of creating a similarity matrix for some 
notion of similarity among these distributions and then performing a dimen¬ 
sionality reduction method. We consider two cases. In the first case the simi¬ 
larity matrix will be the distance between the kernel mean embeddings of the 
distributions in the RKHS (KME case), while in the second case it will be the 
(symmetrized) KL divergence between KDEs (KDE case). For dimensionality 
reduction we will use ISOMAP [13]. In the setup we have access to each of 
N distributions {Pi,... ,Pn} through samples drawn from those distributions. 

{ (P\'\ 

XI > 

Notice that in the KDE case, in order to compute the KL divergence it is 
necessary to obtain a valid density function. A particular advantage of our 
algorithm is that, by choosing the coefficients as described in Section |6T[ the 
resulting sparse approximation is a density function. 

Let us start with the KME case, in which the similarity matrix contains the 
norm of the difference between the distributions’ KMEs. The first task is to 
estimate the KME using some symmetric positive definite kernel (p. For the 
distribution, the empirical estimate of its KME is 

$(P,) = -^</)(-,xf)), 

1 — 1 

with a sparse approximation 

^o{P£)= 


for some set and € IR|, where the a coefficients have been com¬ 

puted according to the update method described in Section 6.1 

Given all the KMEs, we can now construct a distance matrix. Let P be the 
RKHS of (j). We can use the distance induced by the RKHS to create the matrix 
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D, with entries 




H 






Up, 




1/2 


We similarly define Dq based on the sparse KMEs. With such matrix ISOMAP 
can now be performed to visualize the distributions in, say, 

Note that if the samples from Pi and Pi' have ni and nii points, then Di^i' 
takes Q{n'j + nini' + n^/) time to compute. Since we need all the pairwise 
distances, we need Q{N'^) such computations. A sparse approximation of the 
KMEs of Pi and Pi' of sizes ki and ki' would instead yield a computation of 
Q(ki + kiki' + k'ji) for each entry. Assuming all samples have the same size n, 
and the sparse approximation size is fc, then the computation of the distance 
matrix is reduced from Q{N^n^) to Q{N^k^). 

Inspired by the work of |44j . we have performed these experiments on flow 
cytometry data from N = S7 cancer patients, with sample sizes ranging from 
8181 to 108343. We have used the Gaussian kernel, chosen H to be its RKHS, 
and computed the bandwidth based on the ‘iqr’ scale option in R’s KernSmooth 
package. That is, we have computed the interquartile range of the data, averaged 
over each dimension, and divided by 1.35. After the embedding has been done, 
we have performed Procrustes analysis on the points so as to account for possible 
translation and rotation, we also scaled by a suitable factor. 

To determine the maximum size ki of each sparse representation, we recall 
that the SKM procedure takes 0{niki + fc|) kernel evaluations, so in order to 
respect the nike factor, we have chosen a small multiple of for ki. In this 
case we picked ki to be the largest integer smaller than 3y/ni for each t. We have 
implemented the auto-selection scheme described in Section [6.1[ The results for 
the case of e = 10“^° are shown in Pig. and Table Although ki is the 
largest allowed sparsity, the algorithm stops at some km < ki. To determine 
how well Dq approximates D, we have plotted the relative error for 

different values of e, averaged over ten different runs. The result is shown in 
Fig. 13 


Table 1: Time comparison for the Euclidean embedding of the flow cytometry 
dataset - KME case._ 



/c-center 

D computation 

Total 

Full 

0 

8.1hrs 

8.1hrs 

SKM 

21.7mins 

1.4s 

21.7mins 
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Figure 2: 2-dimensional representation of flow cytometry data - KME case. 
Each point represents a patient’s distribution. The embeddings were obtained 
by applying ISOMAP to distances in the RKHS. 
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Figure 3: The relative error incurred by the SKM-based matrix Dq as a function 
of e, averaged over 10 runs - KME case. The average fc-center and Dq compu¬ 
tation times range from 2.4 to 24.6 minutes, and from 0.15 to 2.66 seconds, 
respectively. The average ratio kg/kmax ranges from 0.13 to 0.81. 
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The KDE case is similar. The similarity matrix is composed of the sym¬ 
metrized KL divergence between the KDEs of the distributions, defined as 
dKLip,q) '■= DKL{p\\q) + DKL{q\\p), where Dkl indicates the KL divergence. 
For the distribution, its KDE is 




with a sparse approximation 

foi= 

iGX(') 


for some set and > 0 , J2i which has again been cal¬ 

culated according to the update method described in Section O Note that 
the KL divergence requires two density functions as input, therefore it is im¬ 
portant to obtain a valid density. An advantage of our algorithm is that this 
is possible by obtaining the a coefficients and then projecting into the simplex 
as indicated in |6.1| As in the KME case, we construct the similarity matrix 


:= d-KLifi, ft)- 

To compute the KL divergence we split the data in two, use the first half for 
estimation of the KDE, and the second half for evaluation of the KL divergence. 
We have chosen kf, = for each i, as in the KME case, and used the same 
stopping criterion with e = 10“^°. The results for e = 10“^° are shown in Fig. 
I^and Table the plot of for several values of e is shown in Fig. 

Figs. and 1^ show us that the resulting embedded points using the sparse 
approximation keep the structure as of those using the full kernel means. Notice 
also from the Tables that the sparse approximation is many times faster than 
the full computation (about 20 times faster for each case). Furthermore, in the 
KME case, the main computational investment is made in finding the elements 
of the sets , since the subsequent computation of Dq is of negligible time. 


Table 2: Time Comparison for the Euclidean embedding of the Flow Cytometry 
dataset - KDE case. _ 



fc-center 

D computation 

Total 

Full 

0 

2.18 hrs 

2.18 hrs 

SKM 

5mins 

2mins 

7mins 


7.2 Class Proportion Estimation 

In this setting we are presented with labeled training data drawn from N dis¬ 
tributions {Pi ,..., Pn} and with further testing data drawn from a mixture of 
these distributions Pq = where tt^ > 0 and — 1- Our goal is 

to estimate the mixture proportions {tti, , ttat}. 
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Figure 4: 2-dimensional representation of flow cytometry data - KDE case. Each 
point represents a patient’s distribution. The embeddings were obtained by ap¬ 
plying ISOMAP to distances between KDEs as measured by the KL divergence. 
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Figure 5: The relative error incurred by the SKM-based matrix Dq as a func¬ 
tion of e, averaged over 10 runs - KDE case. The average fc-center and Dq 
computation times range from 1 to 7.25 minutes, and from 35 to 160 seconds, 
respectively. The average ratio fco/fcmax ranges from 0.22 to 0.9. 
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To do so we let Pi represent the KME of for 0 < ^ < TV. We then find 
the proportions that minimize the distance 

N 

where % is the RKHS of the kernel used to construct the KME. By setting 
the derivative to zero the optimal vector of proportions 7r_ := [tti, ..., tttv-i]^, 
subject to = 1 but not to tti > 0, satisfies 

Dtt_ = e, 

where 

= {Pi ~ Pni Pj — 

and 

Ci = (^Pi — Pat, Pq — ■ 

From here we can define 


TT 





A parallel approach, using the KDE instead of the KME is shown in [IS]. In 
that case the distance in P was changed to the distance. 

Notice we have not enforced the constraint TTi > 0, for 1 < i < A^. To do so a 
quadratic program can be set. For most of our simulations we did not encounter 
the necessity to do so. Therefore, for the few cases for which tt lied outside of 
the simplex, we have projected onto it as described in [42j . 

In our setup we have used the handwritten digits data set MNIST, obtained 
from [46|, which contains 60,000 training images and 10,000 testing images, 
approximately evenly distributed among its 10 classes (see [17] for details). We 
have only used the first five digits. 

We present a comparison of the performance, measured by the ii distance 
between the true tt and the estimate tt, of the sparse KME compared to the full 
KME. We have done this for different values of tt, meaning different locations 
of TT inside the simplex. To do so, we sampled tt from the simplex using the 
Dirichlet distribution with different concentration parameter w. As a reminder 
to the reader, a small value of w implies sparse values of tt are most probable, 
ui = 1 means any value of tt is equally probable, and a; > 1 means values of tt 
for which all its entries are of similar value are most probable. We varied w over 
the set {.1, .2,..., 3.1}. 

We have split the data in two and used the first half to estimate the kernel 
bandwidth through the following process. We first sample a true tt, then we 
construct the KME and pick the bandwidth a which minimizes Htt — 7r||^^. We 
performed the search on <t by using Matlab’s function fminbnd. For the SKM 
case we allowed for 200 iterations, while for the full KME case we only allowed 
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for 20 iterations since the computation time is expensive. We have used the 
Gaussian kernel, to create the sparse KME of the distribution, with sparsity 
level of ke = where ni is the size of the available sample from distribution 

i. Since the a coefficients depend on tr, and for each set we perform a search 
over several values of tr, we did not compute a iteratively as we constructed 
Instead, once the construction of was finished, we used the preconditioned 
conjugate gradient method to obtain a. 

Once cr was estimated, we then accessed the second half of the data to test 
the performance for both the SKM and the full KME for different values of uj. 
The results are shown in Eig. We have also plotted for perspective a “blind” 
estimation of tt, which uniformly at random picks a vector tt. A comparison of 
the computation times for the sparse KME and the full KME is shown in Table 
where we have averaged over all values of uj. 

Notice from Tablej^that, in the SKM case, the estimation of a takes about 
the same time as the computation of tt. This is due to the fact that the main 
bottleneck of the algorithm is the computation of the set X which is independent 
of tr. In the case of finding an optimal cr, we applied ten times more iterations 
than in the full KME case, while keeping the process ten times faster. 



Figure 6: Class Proportion Estimation. error of estimated proportions over 
a range of concentration parameters. 


7.3 Mean-Shift Clustering 

We have based this experiment on the mean-shift algorithm as described in [48] . 
This algorithm is used in several image processing tasks and we will use it in the 


20 







Table 3: Computation times for both full and sparse KME, averaged over all 
values of ui. _ 



a estimation 

TT computation 

Total 

Full 

481s 

26s 

8.45mins 

SKM 

47.5s 

48.6s 

1 . 6 mins 


context of image segmentation. The goal is to form a clustering of the image 
pixels into different segments. 

Each pixel is represented by a 5-dimensional vector (3 dimensions to describe 
color, and 2 for the position in the image), and the distribution of these feature 
vectors is estimated by the KDE. Denote the image pixels as Xi S 

K®. The mean-shift algorithm shifts each point lying on the surface of the 
density closer to its closest peak (mode). Given a starting point a;, the algorithm 
iteratively shifts x closer to its mode until the magnitude of the shift is smaller 
than some quantity 7 . The shift exerted on x at each iteration requires the 
computation of the gradient of the KDE at the current position, making mean- 
shift computationally expensive. Denote the shifted points as Once all 

points are shifted close to the different modes, then any clustering algorithm 
can be performed to find the clusters. A clustering algorithm is described in 
[i5] . based on merging the modes’ neighborhoods which are close. We used a 
code following these guidelines found at [JHl : slightly modified by increasing the 
distance used for modes’ neighborhoods to merge. 

In our experiments we used a 500 x 487 image of a painting by Piet Mondrian 
{Composition A), and compared our algorithm with the full density estimation 
case. We chose k^ax to be the largest integer smaller t han and we have used 
the method for auto-selecting fcp outlined in Section O with e = 10“®. We 
have used the Gaussian kernel and set the bandwidth according to Equation (18) 
in ED], which is specifically suggested for mode-based clustering. We compare 
the SKM approach to a method based on Locality Sensitive Hashing (LSH, see 
ED Eg). This method finds for each point and with high probability its nearest 
neighbors, it then approximates the KDE locally by only using the effect from 
such neighbors. We chose 5 nearest neighbors and to implement LSH we used 
the Matlab version of LSH available at [5S] (we have used the e21sh scheme with 
three hash tables per picture). See [53l[54| for details on LSH. 

We present two indicators to evaluate the performance between the cluster¬ 
ing resulting from the full KDE and that resulting from the approximate KDE. 
In the following, let B be used to indicate that the full kernel density estimate 
has been used, while A indicates either the SKM or the LSH approaches. With 
a slight abuse of notation, let A and B also indicate their resulting clusterings. 

Discrepancy Index. Our first performance measure, which we call the dis¬ 
crepancy index di^ is somehow intuitive, and it describes the ratio of the number 
of vectors xi, which the approximate methods shifted by more than 5 away from 
their full method counterpart. 5 is here some tolerance threshold, which we have 
set to three times the kernel bandwidth. More precisely, if indicate the 
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picture pixels and yf are the shifted versions of xi according to density 
estimation methods A and B respectively, then 

l 

Hausdorff Distance. The second performance measure, which describes the 
Hausdorff distance between clusterings, was obtained from and is denoted 
by dn- To define the Hausdorff distance, let P be a distribution on (in 
our case, P is the distribution of the image pixels on K®). Furthermore, let X 
be the set of subsets of such that the distance between two sets A and B is 
p{A, B) := P{AAB), where A is the symmetric difference (to be precise, we deal 
with equivalence classes, where two sets A and B are equivalent if p{A, B) = 0). 
Notice T is a metric space. Let B G X, and define p{A, B) := minsge p{A, B). 
We interpret a subset ^ of T as a clustering, and an element H in T as a cluster. 
The Hausdorff distance between two clusterings is 

dij (^, H) = max |max p{A,B), ^ax p(H,^)j-. 

In words, dn measures the furthest distance between elements of A to the 
clustering B and elements of B to the clustering A. That is, the less over¬ 
lap between clusters of different clusterings, as measured by P. Since we 
don’t have access to P, the empirical version of dn proposed in |55] is ob¬ 
tained by replacing P for the empirical probability measure. Letting p{A, B) := 
minsge i '^{AAB}{xi), we have 


dij(^, H) = max < max p{A,B), max p(B,A) 

(ylGyt BeB 

We use this latter quantity to measure the SKM performance. 

The results are presented in Table In the table B indicates the full kernel 
density estimate has been used, Askm indicates the fc-center based algorithm 
and Alsh the LSH setup. Note that both the SKM and the LSH approach 
present significant computational advantages. The SKM approach, however, 
manages to be faster while incurring half the discrepancy of the LSH and about 
the same Hausdorff distance. 


Table 4: Time and Performance Comparison for Mean Shift algorithm. 




Time 


Performance 


Preparation 

Mean Shift 

Total 

d^{■,B) 

dff(-,H) 

B 

0 

4hrs 

4hrs 

0 

0 

Askm 

3.26mins 

57s 

4.2mins 

0.018 

0.021 

Alsh 

14s 

4.2mins 

4.4mins 

0.034 

0.016 


22 










7.4 Other Simulations 


Unlike other methods for approximating a sum of kernels, the sparse approx¬ 
imation strategy proposed in this paper has the advantage that the resulting 
approximation can be a valid density if the a^’s are set to satisfy > 0 and 
cti = 1 . Therefore, we also evaluate the performance of the proposed sparse 
approximation according to the KL divergence, a common metric between dis¬ 
tributions whose arguments must be density functions. Notice in particular 
that other KDE approximation methods like the Improved Fast Gauss Trans¬ 
form and the LSH-based approach described in Section |7.3| are not applicable 
since they don’t return valid densities. 

For 11 distinct benchmark data sets, listed in Table we computed the KL 
divergences D{z\\zx) and D{zx\\z) between the sparse and the full kernel mean. 
We used the auto-selection scheme proposed in Section [6T| and projected the 
resulting a onto the simplex to ensure we have a valid probability distribu¬ 
tion. We have chosen a Gaussian kernel and used the Jaakkola heuristic [55] 
to compute the bandwidth. To place the performance of our approximation in 
perspective, we have also computed the KL divergences for a sparse approxima¬ 
tion based on choosing the set I uniformly at random. We have performed the 
Wilcoxon rank test m to determine if there is a significant advantage of the 
SKM. The test for both the case D{z\\zx) and the case Yi[zx\\z) yields a p-value 
of 0.0186, favoring the SKM method. The results are shown in Table 

To further illustrate the perform ance of SKM, we look at the error quantities 
E\x\ = 11^ ~— ll^ll^ (see Section 6.11 as the size of X increases. Fig. M shows 
a plot of E\x\ against the size of X for ihe banana data set. As a baseline, we 
have plotted alongside the same error for an approximation based on choosing 
the set X uniformly at random. Since we want to explore how fast \\z — zx\\^ 
approaches zero, we allowed kmax = n but used the auto-selection scheme to 
stop at an earlier /cg with tolerance threshold e = 10“®. We averaged 100 times 
and, at each iteration, we completed the graph by letting E\x\ = Ekg for \X\ > feg- 
The average SKM run stopped at fco=197, and the random sampling comparison 
at fcg = 240. The random approximation shows an initial advantage because 
it is more likely to pick elements from dense areas, which for small values of X 
represents better the full distribution. However, as the size of X increases the 
fine structure (e.g., the distribution tails) is better captured by SKM, since the 
fc-center algorithm picks points far apart from each other. 


8 Conclusion 

We have provided a method to rapidly and accurately build a sparse approx¬ 
imation of a kernel mean. We derived an incoherence based bound on the 
approximation error and recognized that, for radial kernels, its minimization is 
equivalent to solving the fc-center problem on the data points. If desired, our 
construction of the sparse kernel mean may be slightly modihed to provide a 
valid density function, which is important in some applications. Furthermore, 
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Table 5: Values of D(z||zx) and T){zx\\z) for different data sets. 



D(z||zi) 

D(zi||z) 


Random 

SKM 

Random 

SKM 

banana 

0.092597 

0.001805 

0.129183 

0.001613 

image 

0.451205 

0.041305 

0.212585 

0.061584 

ringnorm 

0.003983 

0.031736 

0.009253 

0.02853 

breast-cancer 

0.358253 

0.002546 

0.345895 

4.56E-05 

heart 

0.001918 

6.35E-16 

0.005228 

2.91E-16 

thyroid 

0.177317 

0.000594 

0.034616 

0.000289 

diabetes 

0.031366 

0.005474 

0.014635 

0.000102 

german 

0.008711 

0.003855 

0.008742 

0.00203 

twonorm 

0.000131 

0.000243 

4.59E-05 

0.000372 

waveform 

0.011473 

0.000177 

0.015064 

0.000404 

iris 

0.043924 

0.000395 

0.022519 

0.000104 



k = \I\ 

Figure 7: Comparison of E\x\ between the random algorithm and the /c-center 
algorithm for the banana data set. 
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the algorithm works for both kinds of kernel means: the KDE and the KME. 
Our method also naturally lends itself to a sparsity auto-selection scheme. 

We showed its computational advantages and its performance qualities in 
three specific applications. First, Euclidean embedding of distributions (for 
both KDE and KME), in which, for the KDE case, a valid density is needed 
to compute the KL divergence. Second, class proportion estimation (for the 
KME), which presents the amortization advantages of the SKM approach, in 
this case with respect to the bandwidth a. Finally, mean-shift clustering (for the 
KDE), in which with less computation time than the LSH-based approach, it 
performs better with respect to the discrepancy index and similar with respect 
to the Hausdorff distance. In most instances the proposed sparse kernel mean 
method has shown to be orders of magnitude faster than the approach based 
on the full kernel mean. 
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